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Abstract 

Given i.i.d. data from an unknown distribution, we 
consider the problem of predicting future items. An 
adaptive way to estimate the probability density is 
to recursively subdivide the domain to an appropriate 
data-dependent granularity. A Bayesian would assign 
a data-independent prior probability to "subdivide" , 
which leads to a prior over infinitely many) trees. We 
derive an exact, fast, and simple inference algorithm 
for such a prior, for the data evidence, the predictive 
distribution, the effective model dimension, and other 
quantities. 

1 INTRODUCTION 

Inference. We consider the problem of inference from 
i.i.d. data D, in particular of the unknown distribution 
q the data is sampled from. In case of a continuous 
domain this means inferring a probability density from 
data. Without structural assumption on q, this is hard 
to impossible, since a finite amount of data is never 
sufficient to uniquely select a density (model) from an 
infinite-dimensional space of densities (model class) . 

Methods. In parametric estimation one assumes that 
q belongs to a finite-dimensional family. The two- 
dimensional family of Gaussians characterized by mean 
and variance is prototypical. The maximum likelihood 
(ML) estimate of q is the distribution that maximizes 
the data likelihood. Maximum likelihood overfits if 
the family is too large and especially if it is infinite- 
dimensional. A remedy is to penalize complex distri- 
butions by assigning a prior (2nd order) probability 
to the densities q. Maximizing the model posterior 
(MAP), which is proportional to likelihood times the 
prior, prevents overfitting. Bayesians keep the com- 
plete posterior for inference. Typically, summaries like 



the mean and variance of the posterior are reported. 

How to choose the prior? In finite or small compact 
low-dimensional spaces a uniform prior often works 
(MAP reduces to ML). In the non-parametric case 
one typically devises a hierarchy of finite-dimensional 
model classes of increasing dimension. Selecting the 
dimension with maximal posterior often works well 
due to the Bayes factor phenomenon [Goo83, Jay03, 
Mac03]: In case the true model is low-dimensional, 
higher-dimensional (complex) model classes are auto- 
matically penalized, since they contain fewer "good" 
models. Full Bayesians would assign a prior probabil- 
ity (e.g. jk) to dimension d and mix over dimension. 

Interval Bins. The probably simplest and oldest 
model for an interval domain is to divide the interval 
(uniformly) into bins, assume a constant distribution 
within each bin, and take a frequency estimate for the 
probability in each bin, or a Dirichlet posterior if you 
are a Bayesian. There are heuristics for choosing the 
number of bins as a function of the data size. The 
simplicity and easy computability of the bin model is 
very appealing to practitioners. Drawbacks are that 
distributions are discontinuous, its restriction to one 
dimension (or at most low dimension: curse of dimen- 
sionality), the uniform (or more generally fixed) dis- 
cretization, and the heuristic choice of the number of 
bins. We present a full Bayesian solution to these prob- 
lems, except for the non-continuity problem. Polya 
trees [Lav94] inspired our model. 

More advanced model classes. There are plenty 
of alternative Bayesian models that overcome some or 
all of the limitations. Examples are continuous Dirich- 
let process (mixtures) [Fer73], Bernstein polynomials 
[PW02], Bayesian field theory [Lem03], Bayesian ker- 
nel density estimation or other mixture models [EW95], 
or universal priors [Hut04b], but analytical solutions 



1 



are infcasiblc. Markov Chain Monte Carlo sampling 
or Expectation Maximization algorithms [DLR77] or 
variational methods can often be used to obtain ap- 
proximate numerical solutions, but computation time 
and global convergence remain critical issues. Prac- 
titioners usually use (with success) efficient MAP or 
M(D)L or heuristic methods, e.g. kernel density esti- 
mation [GM03], but note that MAP or MDL can fail, 
while Bayes works [PH04] . 

Our tree mixture model. The idea of the model 
class discussed in this paper is very simple: With equal 
probability, we chose q either uniform or split the do- 
main in two parts (of equal volume), and assign a prior 
to each part, recursively, i.e. in each part again either 
uniform or split. For finitely many splits, q is a piece- 
wise constant function, for infinitely many splits it is 
virtually any distribution. While the prior over q is 
neutral about uniform versus split, we will see that the 
posterior favors a split if and only if the data clearly in- 
dicates non-uniformity. The method is a full Bayesian 
non-heuristic tree approach to adaptive binning for 
which we present a very simple and fast algorithm for 
computing all(?) quantities of interest. 

Contents. In Section 2 we introduce our model and 
compare it to Polya trees. We also discuss some ex- 
ample domains, like intervals, strings, volumes, and 
classification tasks. In Section 3 we present recursions 
for various quantities of interest, including the data ev- 
idence, the predictive distribution, the effective model 
dimension, the tree size and height, and cell volume. 
We discuss the qualitative behavior and state conver- 
gence of the posterior for finite trees. The proper case 
of infinite trees is discussed in Section 4, where we an- 
alytically solve the infinite recursion at the data sepa- 
ration level. Section 5 collects everything together and 
presents the algorithm. We also numerically illustrate 
the behavior of our model on one example distribu- 
tion. Section 6 contains a brief summary, conclusions, 
and outlook, including natural generalizations of our 
model. Sec [Hut04a] for derivations, proofs, program 
code, extensions, and more details. 

2 THE TREE MIXTURE MODEL 

Setup and basic quantities of interest. We are 
given i.i.d. data D = (x 1 ,...,x n ) G T n of size n from 
domain T, e.g. T C M d sampled from some unknown 
probability density q : T — > M. Standard inference 
problems are to estimate q from D or to predict the 
next data item x n+1 e T. By definition, the (objec- 
tive or aleatoric) data likelihood density under model 
q is p{D\q) = q{xi)-...-q(x n ). Note that we consider 
sorted data, which avoids annoying multinomial co- 
efficients. Otherwise this has no consequences. Re- 
sults are independent of the order and depend on the 



counts only, as they should. A Bayesian assumes a 
(belief or 2™ d -order or epistemic or subjective) prior 
p(q) over models q in some model class Q. The data 
evidence is p(D) = j^p{D\q)p{q)dq. Having the evi- 
dence, Bayes' famous rule allows to compute the (be- 
lief or 2" d -order or epistemic or subjective) posterior 
p(q\D)—p(D\q)p(q)/p(D) of q. The predictive or pos- 
terior distribution of x is p(x\D) =p(D,x)/p(D), i.e. 
the conditional probability that the next data item is 
x = x n+1 , given D, follows from the evidences of D 
and (D,x). Since the posterior of q is a complex ob- 
ject, we need summaries like the expected ^-probability 
of x and (co)variances. Fortunately they can also be 
reduced to computation of evidences: £ l [g(a;)|D] := 
Jq(x)p(q\D)dq—p(x\D). In the last equality we used 
the formulas for the posterior, the likelihood, the ev- 
idence, and the predictive distribution, in this order. 
Similarly for the covariance. We derive and discuss 
further summaries of q for our particular tree model, 
like the model complexity or effective dimension, and 
the tree height or cell size, later. 

Hierarchical tree partitioning. Up to now every- 
thing has been fairly general. We now introduce the 
tree representation of domain T. We partition T into 
r and Ti, i.e. r = r UFi and r o nri=0. Recursively 
we (sub)partition Y z — T z0 (jT z i for z <E JB™, where 
— UI!lfc{0,l} J ls the set of all binary strings of 
length between k and m, and r e = r, where e = {0,l}° 
is the empty string. We are interested in an infinite 
recursion, but for convenience we assume a finite tree 
height m < oo and consider m — > oo later. Also let 
l:=£(z) be the length of string z = Z\...zi =: z\-i, an d 
|r z | the volume or length or cardinality of T z . 

Example spaces. Intervals: Assume r=[0,l) is the 
unit interval, recursively bisected into intervals T z = 
[0.z,0.z+2- ; ) of length |r z | = 2~', where Q.z is the real 
number in [0,1) with binary expansion z\...Z{. 

Strings: Assume T z — {zy.ye{0,l} m ^ 1 } is the set of 
strings of length m starting with z. Then r = {0,l} m 
and |r z | — 2 m ^ 1 . For to = oo this set is continuous, for 
m<oo finite. 

Trees: Let T be a complete binary tree of height m 
and T z0 (F zl ) be the left (right) subtree of T z . If \T Z \ 
is defined as one more than the number of nodes in T z , 
then \T z \ = 2 m+1 - 1 . 

Volumes: Consider T C M d , e.g. the hypercube T = 
[0,l) d . We recursively halve T z with a hyperplane 
orthogonal to dimension (I mod d) + l, i.e. we sweep 
through all orthogonal directions. |I\ s | = 2 _i |r|. 

Compactification: We can compactify TC (l,oo] (this 
includes r = JV\{l}) to the unit interval T':={±:xe 
r}C [0,1), and similarly VCR (this includes T = Z) to 
T' := {x e [0,1) : J?(lZx) € T}. All reasonable spaces can 
be reduced to one of the spaces described above. 
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Classification: Consider an observation o € V (e.g. 
email) that is classified as c€ {0,1} (e.g. good versus 
spam), where T' could be one of the spaces above (e.g. 
o is a sequence of binary features in decreasing order 
of importance). Then x := (o,c) e T := V x {0,1} and 
T 0z = T' z x {0} and T lz = T' z x {1}. Given D (e.g. pre- 
classified emails), a new observation o is classified as 
c with probability p(c\D,o)ocp(D,x). Similar for more 
than two classes. 

In all these examples (we have chosen) |F z o| = |T z i| = 
i|r z | Vz <G JBq 1 ^ 1 , and this is the only property we 
need and henceforth assume. W.l.g. we assume/define/ 
rescale |T| = 1. Generalizations to non-binary and non- 
symmetric partitions are straightforward and briefly 
discussed at the end. 

Identification. We assume that {T z :z^lB™} are (ba- 
sis) events that generate our a- algebra. For every xeT 
let x' be the string of length i(x')=m such that xET x r. 
We assume that distributions q are cr-measurable, i.e. 
to be constant on T x > W <G lB m . For m = oo this as- 
sumption is vacuous; we get all Borel measures. Hence, 
we can identify the continuous sample space T with the 
(for m < oo discrete) space JB m of binary sequences of 
length m, i.e. in a sense all example spaces are isomor- 
phic. While we have the volume model in mind for 
real-world applications, the string model will be con- 
venient for mathematical notation, the tree metaphor 
will be convenient in discussion, and the interval model 
will be easiest to implement and to present graphically. 

Notation. As described above, T may also be a tree. 
This interpretation suggests the following scheme for 
defining the probability of q on the leaves x' . The 
probability of the left child node zO, given we are in 
the parent node z, is P[r z0 |r z ,(7], so we have 

p(x\T z ,q) =p(x\r z0 ,q)-P[T z0 \T z ,q} if x g T z q 

and similarly for the right child. In the following we 
often have to consider distributions conditioned to and 
in the subtree T z , so the following notation will turn 
out convenient 

q z0 --=P[T z0 \T z ,q], Pz (x\...) := 2- l p(x\T z ...) (1) 

m 

^>p z (x\q) = 2q ZXl+lPzXl+1 (x\q) = ...= Y[ 2g Xl:i if x e T z 

i=l+l 

where we have used that p(x\T x r,q) — |r x /| _1 = 2 m is 
uniform. Note that q z o + q z i = 1- Finally, let q z * :— 
(q zy :ye]B™- 1 ) be the (2 m -' +1 -2)-dimensional vector 
or ordered set or tree of all reals q zy £ [0,1] in subtree 
r z . Note that q z £ q„. The (non) density q z (x) := 
p z (x\q) depends on all and only these q zy . For z^e, 
q z () and p z () are only proportional to a density due 
to the factor 2~ l , which has been introduced to make 
p x i{x\. ..) = !. (They are densities w.r.t. 2'Air^, where 



A is the Lebesgue measure.) We have to keep this in 
mind in our derivations, but can ignore this widely in 
our discussion. 

Polya trees. In the Polya tree model one assumes 
that the q z o = l — q z i are independent and Beta(-,-) dis- 
tributed, which defines the prior over q. Polya trees 
form a conjugate prior class, since the posterior is also 
a Polya tree, with empirical counts added to the Beta 
parameters. If the same Beta is chosen in each node, 
the posterior of x is pathological for m^oo: The dis- 
tribution is everywhere discontinuous with probability 
1. A cure is to increase the Beta parameters with I, 
e.g. quadratically, but this results in "undcrfitting" for 
large sample sizes, since Beta(large, large) is too infor- 
mative and strongly favors q Z Q near i. It also violates 
scale invariance, which should hold in the absence of 
prior knowledge. That is, the p(oste)rior in r o = [0,i) 
should be the same as for T = [0,1) (after rescaling all 
x^x/2 in D). 

The new tree mixture model. The prior P[q] 
follows from specifying a prior over q*, since q(x) cx 
q Xl •■■■•Qxi-.m by (1). The distribution in each subset 
r 2 CT shall be either uniform or non-uniform. A nec- 
essary (but not sufficient) condition for uniformity is 

<7z0 = <Zzl = 2 • 

P u (q z0 ,q z i) ■= S(q z0 - \)8{q zl - \), (2) 

where i5() is the Dirac delta. To get uniformity on T z 
we have to recurse the tree down in this way. 

P u (q z *) ■= P u (q z o,q z i)p u (q z o*)p u (q z u) (3) 

with the natural recursion termination p u (q z *) = 1 
when £(z) = m, since then q„ — cf). For a non- 
uniform distribution on T z we allow any probabil- 
ity split q(T z ) = q(T z0 ) + q(T z0 ), or equivalently 1 = 
q z o + q z i- We assume a uniform prior on the split, i.e. 

P s {q z o,q z i) ■= S(q z0 + q zl -l) (4) 

We now recurse down the tree 

p s (q z *) ■= p s (q z o,q z i)p(q z o*)p(q z i*) (5) 

again with the natural recursion termination p(q z *) = 
p((/)) = 1 when £(z) = m. Finally we have to mix the 
uniform with the non-uniform case. 

P(q z *) ■■= P(u)p u (q z *)+p(s)p s (q z *) (6) 

We choose a 50/50 mixture p(u) =p(s) = |. This com- 
pletes the specification of the prior P [q]=p(q*)- 

For example, if the first bit in x is a class label and 
the remaining are binary features in decreasing order of 
importance, then given class and features z = X\ : i, fur- 
ther features xi+i :m could be relevant for classification 
(q z (x) is non- uniform) or irrelevant (q z (x) is uniform). 
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Comparison to the Polya tree. Note the important 
difference in the recursions (3) and (5). Once we de- 
cided on a uniform distribution (2) we have to equally 
split probabilities down the recursion to the end, i.e. 
we recurse in (3) with p u , rather than the mixture p 
(this actually allows to solve the recursion). On the 
other hand if we decided on a non- uniform split (4), 
the left and right partition each itself may be uniform 
or not, i.e. we recurse in (5) with the mixture p, rather 
thanp s . Inserting (4) in (5) in (6) and recursively (2) 
in (3) in (6) we get 

P(<fz*) = in*^" - ^ + h S (9zO+Qzi-^)p(QzO*)p(Qzi*) 
yen?-' (7) 
Choosing p(u) = would lead to the Polya tree model 
(and its problems) with q z o^ Beta(l,l). For our choice 
(p(u) = 5), but with p instead of p u on the r.h.s. of 
(3) we would get a quasi-Polya model (same problems) 
with <7 2 o<~5[Beta(oo,oo)-|-Beta(l,l)]. 

For m^oo, our model is scale invariant and leads to 
continuous distributions for n^oo, unlike the Polya 
tree model. We also don't have to tune Beta pa- 
rameters; the model tunes itself by suitably assign- 
ing high/low posterior probability to subdividing cells. 
While Polya trees form a natural conjugate prior class, 
our prior does not directly, but can be generalized to 
do so [Hut04a]. The computational complexity for 
the quantities of interest will be the same (essentially 
0(n)), i.e. as good as it could be. 

Formal and effective dimension. Formally our 
model is 2 • (2™ 1 — l)-dimensional, but the effective di- 
mension can by much smaller, since <f* is forced with a 
non-zero probability to a much smaller polytope, for in- 
stance with probability | to the zero-dimensional glob- 
ally uniform distribution. We will compute the effec- 
tive p(oste)rior dimension. 

3 QUANTITIES OF INTEREST 

The evidence recursion. At the end of Section 2 we 
defined our tree mixture model. The next step is to 
compute the standard quantities of interest defined at 
the beginning of Section 2. The evidence p{D) is key, 
the other quantities (posterior, predictive distribution, 
expected q(x) and its variance) follow then immedi- 
ately. Let D z := {x e D : x E T z } be the n z := \D Z | data 
points that lie in subtree T z . We compute p z (D z ) re- 
cursively for all ze_ZB™ -1 , which gives p{D)=p t {D t ). 
Inserting (1) and (7) into 



w(n zQ ,n zl ) := 



Pz(D z ) 



J Pz(Dz 



\q z *)p(q z *)dq z 



one can derive the following recursion [Hut04a] : 

Pz{D z ) 



1 ^ | P z o{D z o)p z i{D z i) i 
' w(n z0 ,n zl ) 



(8) 



(9) 



n,. 



n z0 \n zl \ 
n zQ + n zl , A z 



n z p 
n z 



The recursion terminates with p z (D z ) = 1 when l{z) = 
to. Recall (1) if you insist on a formal proof: For 
£(z) — to and x GT Z we have T x > = T z p z (x\q) = 1 
=>Pz(D x \q) = l=>p x (D x ) = l. 

Interpretation of (9): With probability \ 1 the evi- 
dence is uniform in T z . Otherwise data D z is split into 
two partitions of size n z0 and n z \ =n z — n z0 . First, 
choose n Z Q uniformly in {0,...,n z }. Second, given n z , 
choose uniformly among the ( ™* ) possibilities of se- 
lecting n z0 out of n z data points for T z0 (the remaining 
n z \ are then in Y z \). Third, distribute D z q according 
to p z o{D z0 ) and D z \ according to p z i(D zl ). Then, the 
evidence in case of a split is the second term in (9). 
The factor 2 Uz is due to our normalization convention 
(1). This also verifies that the r.h.s. yields the l.h.s. if 
integrated over all D z , as it should be. 

Discussing the weight. The relative probability of 
splitting (second term on r.h.s. of (9)) to the uniform 
case (first term in r.h.s. of (9)) is controlled by the 
weight w. Large (small) weight indicates a (non) uni- 
form distribution, provided p Z Q and p z \ are O(l). Bal- 
ance A z ss (76 0) indicates a (non) symmetric parti- 
tioning of the data among the left and right branch of 
T z . Asymptotically for large n z (keeping A z fixed), we 
have 

'2n z n -2n z Ai 



Assume that data D is sampled from the true distri- 
bution q. The probability of the left branch r z0 of T z 
is q z o = P[T z o\T z ,q] = 2 l q z (r z o). The relative frequen- 
cies asymptotically converge to q Z Q . More precisely 
^=™g z0 ±O(n- 1 / 2 ) with probability 1 (w.p.l). Simi- 
larly for the right branch. Assume the probabilities are 
equal (q Z Q =q z i = ^), possibly but not necessarily due 
to a uniform q z Q on T z . Then A z = (3(nJ 1 / 2 ), which 
implies 



w n ,(A z ) ~ Q(Vn z ) ~ * 00 if q z0 

w.p.l 



q z i 



consistent with our anticipation. Conversely, for q Z Q ^ 
q z i (which implies non-uniformity of q z Q) we have 
A z ^>c:=q Z Q — ^7^0, which implies 

w nz (A z ) ~ J*fe- 2n > c2 n ^°° if q zQ + q zl , 

V w.p.l 

again, consistent with our anticipation. 

Asymptotic convergence/consistency (n — yoo). 

For fixed m<oo, one can show that almost surely the 
posterior p z (q z *\D) concentrates around the true dis- 
tribution (f z „ for n^> 00. This implies that the posterior 
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p z {x\D z )—>q z (x) for all xGF z . One can also show that 
the evidence p z (-D z )— > | or 1 for uniform q z (), and in- 
creases exponentially with n z for non- uniform q z () (see 
[Hut04a] for proofs). 

Model dimension and cell number. As discussed 
in Section 2, the effective dimension of <f* is the number 
of components that are not forced to \ by (2). Note 
that a component may be "accidentally" \ in (4), but 
since this is an event of probability 0, we don't have 
to care about this subtlety. So the effective dimension 
N q z , =#{<? < S(f z * °f Qz* can be given recursively 

as 



to or q z0 = 5 
else 



(10) 



'0 if £{z) 

1 + JVy.o. + JV &1 . 

The effective dimension is zero if g z = |, since this im- 
plies that the whole tree r z has (fey = 5 due to (7). If 
q z 7^ § j we add the effective dimensions of subtrees F z0 
and T z i to the root degree of freedom q z o = q z — Qzi- 
Bayes' rule allows to represent the posterior probabil- 
ity that N$ zii = k as 

Pz[Nq z , = k\D z \p z {D z ) = J 5 Nszfk Pz{D z \q z *)p{q z *)dq z * 

where P z [...|...] :=P[...|r z ...], and 5 a b = l for a — b and 
else. The r.h.s. coincides with (8) except for the extra 
factor 5n ? k- Analogous to the evidence (8), using 
(10) we can prove the following recursion: 

P z [N ?z , = 0\D Z ] = 1 - g z {D z ), for I < m, 

P z [N^=k + l\D z ] = (11) 

k 

g z (D z )^P z0 [N ?z0t =i\D z0 ] ■ P zl [N ?zlt =k-i\D zl ], 



=0 



(12) 



Pz[N 9 „=k\D z ]=S k0 :={ 1 ^ Q } for I = ; 

, D \. = 1 Pzo(Dzo)Pzl(Dzl) (9) 1 _ 1 

2 Pz(D z )w(n z0 ,n zl ) 2p z (D z ) 

Read: The probability that tree T z has dimension fc+1 
equals the posterior probability g z (D z ) of splitting T z , 
times the probability that left subtree has dimension i, 
times the probability that right subtree has dimension 
k—i, summed over all possible i. 

Let us define a cell or bin as a maximal volume on 
which q() is constant. Then the model dimension is 1 
less than the number of bins (due to the probability 
constraint). Hence we also have a recursion for the 
distribution of the number of cells. 

Tree height and cell size. The effective height of 
tree <f z * at x € T z is also an interesting property. If 
<ZzO = \ or = Tn, then the height h^ zt (x) of tree <f z * 
at x is obviously zero. If q z o 7^ 5, we take the height of 
the subtree q Z x l+1 * that contains x and add 1: 







1 + hff. 



if 



£{z) = m 
else 



or q z0 



One can show that the tree height at x averaged over 
all trees <f z * is 

E z [h q ~Jx)\D z }^g z (D z )[l + E ZXl Jh q ^ i+i Xx)\D ZXl+1 ] 

where E z [f ? J...] = fP z [f q - z J...]p(q z *)dq z *. We may 
also want to compute the tree height averaged over 
all x G T z . For £(z) < m and q z0 ^|we get 



hq z ,i x )q( x \ T 'z)dx = l + q x0 -h$ z0 .+q z rhjj xlt , 



Ez[hJD z ] = g z (D z ) 



+ 



n z p+ 1 
n z + 2 
n zl + 1 
n z + 2 



E z o[hq z0 , \D z0 ] 
E zl [hf z JD zl ] 



with obvious interpretation: The expected height of a 
subtree is weighted by its relative importance, that is 
(an estimate of) its probability. The recursion termi- 
nates with E z [h q ~ z JD z ]=0 when £(z)=m. We can also 
compute intra and inter tree height variances. 

Finally consider the average cell size or volume 
v. Maybe more useful is to consider the logarithm 
— log 2 |r z |=£(z), since otherwise small volumes can get 
swamped in the expectation by a single large one. Log- 
volume Vf z , =£(z) if £(z) — m or q z — \, and else recur- 
sively v$ z , =qzoVq- z0t +q z iVq zlt . We can reduce this to 
the tree height, since v^ zt =h q * zt +£(z), in particular 

Vq, = hq, 

4 INFINITE TREES (m^oo) 

Motivation. We have chosen an (arbitrary) finite tree 
height m in our setup, needed to have a well-defined 
recursion start at the leaves of the trees. What we are 
really interested in are infinite trees (m=oo). Why not 
feel lucky with finite ml First, for continuous domain 
T (e.g. interval [0,1)), our tree model contains only 
piecewise constant models. The true distribution q() is 
typically non-constant and continuous (Beta, normal, 
...). Such distributions are outside a finite tree model 
class (but inside the infinite model), and the posterior 
p(x\D) cannot converge to the true distribution, since 
it is also piecewise constant. Hence all other estimators 
based on the posterior are also not consistent. Second, 
a finite m violates scale invariance (a non-informative 
prior on T z should be the same for all z, apart from 
scaling). Finally, having to choose the "right" m may 
be worrisome. 

For increasing to, the cells T x become smaller and 
will (normally) eventually contain either only a single 
data item, or be empty. It should not matter whether 
we further subdivide empty or singleton cells. So we 
expect inferences to be independent of to for sufficiently 
large to, or at least the limit to ^00 to exist. In this 
section we show that this is essentially true. 
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Prior inferences (D — (f>). Wc first consider the prior 
(zero data) case D = <f>. Recall that z E IB™ is some 
node and x G IB m a leaf node. Normalization implies 
Pz(4>) = 1 f° r au z i which is independent of to, hence 
the prior evidence exists for to — >oo. This is nice, but 
hardly surprising. 

The prior effective model dimension is more in- 
teresting. D = <f) implies D z = <fr implies n z = implies 
w(n z0 ,n z i) = 1 implies a 50/50 prior chance g z {4>) = \ 
for a split (see (12)). Recursion (11) reads 

k 

P z [JV f „ =fc + l] = |2 P,o[JV?,o. =*1 ' [^f.i. 

i=0 

with P z [JV^. = k] = 4o for Z = to and P z = 0] = § 
for I < m. So the recursion terminates in recursion 
depth min{fc+l,TO — /}. Hence P z [N^ zit =k+l] is the 
same for all to > l + k, which implies that the limit 
to^oo exists. Furthermore, recursion and termination 
are independent of z, hence also :—P z [N^ zt = k\. So 
we have to solve the recursion 

k 

ak + 1 — \ X! a i' a k-i witn a o = 5 (13) 

i=0 

The first few coefficients can be bootstrapped by hand: 
(2>8'T6'l28 '256 '1024' 2048 '•")■ ^ closed form can also 
be obtained: Inserting (13) into f(x) := X)^Lo afe2;fe+1 
we get f(x) = \[x + f 2 (x)} with solution f(x) = 1 — 
y/l — x, which has Taylor expansion coefficients 

(fflfe)fcGWo is a well-behaved distribution. It decreases 
fast enough to be a proper measure (^2 k ak = /(l) = 1 < 
oo), but too slow for the expectation E[N^] = Yl^k- 
cik = oo to exist. This is exactly how a proper non- 
informative prior on IV should look like: as uniform 
as possible, i.e. slowly decreasing. Further, P[N^ < 
oo]=^ fe afe = l implies P[N$ t <oo\D] = l, which shows 
that the effective dimension is almost surely finite, i.e. 
infinite (Polya) trees have probability zero. 

For the tree height we have E z [h$ zt (x)] =0 if l = m 
and otherwise 

E x [h s „{x)\ = ^ + E ZXl+1 [h q ^ l+1 ,(x)}} 

= ... = 1- (\) m - 1 -» 1 for to^oo 

This also implies that the expected average height 
E z [h ? J = 1- (±) m - 1 -» 1. This is the first case where 
the result is not independent of to for large finite to, 
but it converges for to^oo, what is enough for our 
purpose. 

Single data item D—(x). Since p(x) = 1 (by sym- 
metry and normalization) and w\ = 1 are the same as 



for the n — case, all prior n = 0, to — > oo results re- 
main valid for n= 1: g(x) = |, P[A^ = fc|a;] = a^, and 
£7[%.(a;)|a;]-»l. 

General D. We now consider general D. For con- 
tinuous spaces r and non-singular distribution g, the 
probability of observing the same point more than once 
(multi-points) is zero and hence can, to a certain ex- 
tend, be ignored. See [Hut04a] for a thorough workout 
of this case. In order to compute p(D) and other quan- 
tities, we recurse (9) down the tree until D z is either 
empty or a singleton D z — (x) G T z . We call the depth 
rrix :=t(z) at which this happens, the separation level. 
In this way, the recursion always terminates. For in- 
stance, for r=[0,l), if £:=min{|x J — x j \ :x t ^x i with 
gD} is the shortest distance, then TO x <log 2 f =: 
mo < oo, since e > 0. At the separation level we can 
insert the derived formulas for evidence, posterior, di- 
mension, and height. Note, there is no approximation 
here. The procedure is exact, since we analytically 
computed the infinite recursion for empty and single- 
ton D. 

So we have devised a finite procedure, linear in the 
data size n, for exactly computing all quantities of in- 
terest in the infinite Bayes tree. In the worst case, 
we have to recurse down to level m for each data 
point, hence our procedure has computational com- 
plexity O(n-TO ). For non-singular prior, the time is 
actually 0{n) with probability 1. So, inference in our 
mixture tree model is very fast. Posterior (weak) con- 
vergence/consistency for m—oo can be shown similarly 
to the to < oo case [Hut04a] . 

5 THE ALGORITHM 

What it computes. In the last two sections we 
derived all necessary formulas for making inferences 
with our tree model. Collecting pieces together we 
get the exact algorithm for infinite tree mixtures be- 
low. It computes the evidence p(D), the expected 
tree height E[h^ (x)\D] at x, the average expected tree 
height E[h^ \D], and the model dimension distribution 
P[Nf t \D\. It also returns the number of recursive func- 
tion calls, i.e. the size of the explicitly generated tree. 
The size is proportional to n for regular distributions 

The BayesTree algorithm (in pseudo C code) takes 
arguments (D\\,n,x,N); data array D[0..n— 1] € [0,1)™, 
a point xG-ZR, and an integer N . It returns (p,h,h,p\\,r); 
the logarithmic data evidence p=lnp(D), the expected 
tree height h=E[h^(x)\D] at x, the average expected 
tree height h=E[h^ \D], the model dimension distribu- 
tion p[0.. N— 1]=P[N^ = ..|D], and the number of recur- 
sive function calls r i.e. the size of the generated tree. 
Computation time is about iV 2 nlogn nano-seconds on 
a 1GHz P4 laptop. 
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BayesTree(U[],n,x,JV) 

[ if (n<l and (n==0 or L>[0]==x or x^[0,l))) 
|~ if (xe [0,1)) then h=l; else h = 0; 

ft = l; p = ln(l); r = l; 
[ for(fc = 0,..,JV-l)p[fe]=a fe ; /* see (13) */ 

else 

|~ n = ni = 0; 
for(i = 0,..,n— 1) 

[ if (£)[»]< i) then[D [n ]=2D[i]; n = n + l;] 

[ else [D 1 [n 1 ]=2D\i]-l; n 1 =n 1 + l;] 

(p , ^ o , fto ,Po [] ,»"o ) =BayesTree (D [] , n , 2x , N - 1 ) ; 

(piMM-.Pi D ,ri )=BayesTree(Di [] ,ni ,2x-l ,iV-l) ; 

t=p +Pi-lnw(n ,ni); 

if (f<100) then p = ln(i(l+cxp(t)); 

else p=t— ln(2); 
5 = l-|exp(-p); 

if (xe[0,l)) then ft = 5-(l+/i +/ii); else /i = 0; 

^=.9-(i+i^o+^i); 
p[0] = l-s; 

for(fc = 0,..,7V-l)p[fc+l]= 5 -E- = o? 5 oW-pi[fc-z]; 
L r = l+r +ri; 
L return (p,h,/i,p[],r); 

How algorithm BayesTree() works. Since evi- 
dence p(D) and weight l/w n can grow exponentially 
with n, we have to store and use their logarithms. 
So the algorithm returns p=lnp(D). In the n < 1 
branch, the closed form solutions p=lnp(<p) = ln(l), 
h=E[h^ (x)\4> or x] = l, h=E[h^ \D] = 1, andp[fc]=«fc 
have been used to truncate the recursion. If £) = 
(x 1 ) 7^ x, we have to recurse further until x falls in 
an empty interval. In this case or if n> 1 we partition 
D into points left and right of |. Then we rescale the 
points to [0,1) and store them in D and D\, respec- 
tively. Array D could have been reused (like in quick 
sort) without allocating two new arrays. Then, algo- 
rithm BayesTree() is recursively called for each parti- 
tion. The results are combined according to the recur- 
sions derived in Section 2. \mv can be computed from 
(9) via lnn! = ^^ =1 lnfc. (Practically, pre-tabulating a*, 
or n\ does not improve overall performance). For com- 
puting p we need to use ln(^(l+e*))=t— ln2 to machine 
precision for large t in order to avoid numerical over- 
flow. 

Remarks. Strictly speaking, the algorithm has run- 
time O(nlogn), since the sorting effectively runs once 
through all data at each level. If we assume that the 
data are presorted or the counts n z are given, then 
the algorithm is 0(n) [Hut04a]. The complete C code, 
available from [Hut04a], also handles multi-points. 



Note that x passed to BayesTree() is not and can- 
not be used to compute p(x\D). For this, one has to 
call BayesTreeQ twice, with D and (D,x), respectively. 
The quadratic order in N is due to the convolution, 
which could be reduced to O(NlogN) by transforming 
it to a scalar product in Fourier space with FFT. 

Multiply calling BayesTreeQ, e.g. for computing the 
predictive density function p(x\D) on a fine x-grid, is 
inefficient. But it is easy to see that if we once pre- 
compute the evidence p z (D z ) for all z up to the sepa- 
ration level in time 0(n), we can compute "local" quan- 
tities like p(x\D) at x in time O(logn). This is because 
only the branch containing x needs to be recursed, the 
other branch is immediately available, since it involves 
the already pre-computed evidence only. The predic- 
tive density p(x\D) — E[q(x)\D] and higher moments, 
the distribution function P[x < a\D], updating D by 
adding or removing one data item, and most other lo- 
cal quantities can be computed in time O(logn) by such 
a linear recursion. 

A good way of checking correctness of the implemen- 
tation and of the derived formulas, is to force some 
minimal recursion depth ml . The results must be in- 
dependent of m', since the closed- form speedups are 
exact and applicable anywhere beyond the separation 
level. 

Numerical example. To get further insight into the 
behavior of our model, wc numerically investigated 
some example distributions q(). We have chosen el- 
ementary functions, which can be regarded as proto- 
types for more realistic functions. They include the 
Beta, linear, a singular, piecewise constant distribu- 
tions with finite and infinite Bayes trees, and oth- 
ers. These examples on [0,1) also shed light on the 
other spaces discussed in Section 2, since they are iso- 
morphic. The posteriors, model dimensions, and tree 
heights, of the singular distribution q(x) = 2/y/l — x 
arc plotted in Figure 1 for random samples D of sizes 
n= 10°,...,10 5 . The posterior p{x\D) clearly converges 
for n — > oo to the true distribution qQ, accompanied 
by a (necessary) moderate growth of the effective di- 
mension. For n — 10 we show the data points. It is 
visible how each data point pulls the posterior up, as 
it should be ("one sample seldom comes alone"). The 
expected tree height E[h(x)\D] correctly reflects the 
local needs for (non)splits, i.e. is larger near the singu- 
larity at x = 1 . The other examples display a similar 
behavior (see [Hut04a]). 

6 DISCUSSION 

We presented a Bayesian model on infinite trees, where 
we split a node into two subtrees with prior probabil- 
ity and uniform choice of the probability assigned 
to each subtree. We devised closed form expressions 
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Figure 1: BayesTree() results for a prototypical proper singular distribution q(x) = 2/\/T—x. 



for various inferential quantities of interest at the data 
separation level, which led to an exact algorithm with 
runtime essentially linear in the data size. The theo- 
retical and numerical model behavior was very reason- 
able, e.g. consistency (no undcrfitting) and low finite 
effective dimension (no overfitting) . 

There are various natural generalizations of our 
model. The splitting probability pis) could be cho- 
sen different from ~, fc-ary trees could be allowed, 
and the uniform prior over subtrees could be gener- 
alized to Beta/Dirichlet distributions. We were pri- 
marily interested in the case of zero prior knowledge, 
hence zero model (hyper)parameters, but the general- 
izations above make the model flexible enough, in case 
prior knowledge needs to be incorporated. The depen- 
dency on p(s) is particularly interesting [Hut04a]. The 
expected entropy can also be computed by allowing 
fractional counts n z and noting that xlnx— ■4-X a \ a =i 
[Hut02]. A sort of maximum a posteriori (MAP) tree 
skeleton can also easily be read off from (9). A node 
r 2 in the MAP-like tree is a leaf iff p^(^o)p,i(o,i) < L 
A challenge is to generalize the model from piecewise 
constant to piecewise linear continuous functions, at 
least for T= [0,1). Independence of subtrees no longer 
holds, which was key in our analysis. 
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